-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use bloom filter for evaluating dynamic filters on strings #24528
base: master
Are you sure you want to change the base?
Conversation
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Outdated
Show resolved
Hide resolved
137c6e9
to
0edc826
Compare
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Outdated
Show resolved
Hide resolved
bloom = new long[bloomSize]; | ||
bloomSizeMask = bloomSize - 1; | ||
for (Slice value : values) { | ||
long hashCode = XxHash64.hash(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slice has a hashCode that is using XxHash64 already (and is memoized). Just value.hashCode()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Slices are temporary objects that are created from a single contiguous Block, depending on the Type the Slice may be subject to truncation and padding as well.
So I don't think we gain anything by memoized hash code.
On the other hand, the hashing logic for bloom filter could evolve to be different from Slice's hashCode implementation.
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/gen/columnar/BloomFilter.java
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/planner/DomainTranslator.java
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/sql/planner/DomainTranslator.java
Show resolved
Hide resolved
Could you please add a high-level description about where the oprimizations proposed in this PR would apply. |
13b8ccd
to
d8b44ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great generally
for (Slice value : values) { | ||
long hashCode = XxHash64.hash(value); | ||
// Set 3 bits in a 64 bit word | ||
bloom[bloomIndex(hashCode)] |= bloomMask(hashCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider using an open hash table of xxhash codes instead of the bloom filter? This could trade some performance for more accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to use this eventually for collecting and evaluation dynamic filters with millions of distinct values, so I want the trade-offs to be in favor of using less memory and CPU
List<Supplier<FilterEvaluator>> subExpressionEvaluators = currentPredicate.getDomains().orElseThrow() | ||
.entrySet().stream() | ||
.map(entry -> { | ||
if (canUseBloomFilter(entry.getValue())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea. Potentially we could use a less accurate bloom filter (limited in size) for a dynamic filter with too many values for a normal filter if the accuracy is worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For smaller DF we should stick with more accurate implementation.
@@ -0,0 +1,181 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly we replace the current implementation that uses ObjectOpenCustomHashSet for the bloom filter. That trades the accuracy of the filter for performance. Could you make that explicit in the commit message?
Do you have an estimate of this bloom filter accuracy? Looks like it is pretty good given the size of the filter i.e. only conflicts matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In io.trino.sql.gen.TestDynamicPageFilter#testSliceBloomFilter
there is an assertion which checks that accuracy for a filter with 0.1
selectivity is between (0.1, 0.115)
. It's a bit less accurate than the more canonical bloom filter implementations in orc and parquet, but it's significantly faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the commit message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DF selectivity threshold should not be lowered by 0.05 then?
ffdebcc
to
b377b55
Compare
@raunaqmorarka did you run benchmarks? (TPCH) |
Joins in TPC benchmarks are mostly on bigints, so this doesn't matter there, I'll run something manually for that |
0e6b146
to
8a284d5
Compare
Improves efficiency of evaluating dynamic filters on strings with the potential for some false positives compared to exsitng approach BenchmarkDynamicPageFilter.filterPages (filterSize) (inputDataSet) (inputNullChance) (nonNullsSelectivity) (nullsAllowed) Mode Cnt Before Score After Score Units 2 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 80.908 ± 1.927 172.244 ± 1.067 ops/s 5 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 81.052 ± 2.569 175.619 ± 1.225 ops/s 10 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 76.787 ± 1.561 176.371 ± 0.559 ops/s 100 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 75.631 ± 1.372 174.288 ± 1.024 ops/s 1000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 69.615 ± 0.721 173.340 ± 0.867 ops/s 10000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 75.401 ± 1.233 173.285 ± 1.752 ops/s 100000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 64.335 ± 2.936 170.087 ± 1.370 ops/s 1000000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 16.808 ± 3.205 170.403 ± 1.471 ops/s 5000000 VARCHAR_RANDOM 0.01 0.2 false thrpt 10 15.766 ± 0.820 150.588 ± 4.034 ops/s
8a284d5
to
a859c01
Compare
Description
Additional context and related issues
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: